An Argument Forwarding Queue Machine for Improved Memory Performance

ثبت نشده
چکیده

The Argument Forwarding Queue (AFQ) machine architecture combines the control ow paradigm of Von Neumann machines with the ability of dataaow machines to directly forward data resulting from operations to instructions that use the results. The forwarding capability is used to organize the data operands required by successive instructions into data blocks such that operands for successive instructions are stored in successive locations. By coupling the fetching of instructions with the fetching of data, we can reduce the delays caused by data cache misses. In addition, AFQ supports non-blocking loads where the requests issued to memory can be returned in any order; thus providing better tolerance to memory latency. We describe an implementation of AFQ and show how it can be extended into a superscalar machine and incorporate speculative instruction execution. A program is compiled into a combination control-data ow graph (CDFG) which is directly executed by the AFQ machine. With each instruction basic block (IBB) in the CDFG, we associate a data basic block (DBB) in the activation record, such that the operands of instructions are stored in DBB in the order they are used by the instructions in IBB. The result of an operation is forwarded directly to the appropriate position in a DBB that corresponds to an instruction in IBB that uses the data value. We outline the compilation process for this machine and illustrate how existing program optimization and transformation techniques can be used to improve the performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

Parallel algorithms for geometric shortest path problems

The original goal of this project was to investigate and compare the experimental performance and ease of programming of algorithms for geometric shortest path finding using shared memory and message passing programming styles on a shared memory machine. However, due to the extended unavailability of a suitable shared memory machine, this goal was only partially met, though a system suitable fo...

متن کامل

N-Policy for M/G/1 Machine Repair Model with Mixed Standby Components, Degraded Failure and Bernoulli Feedback

In this paper, we study N-policy for a finite population Bernoulli feedback queueing model for machine repair problem with degraded failure. The running times of the machines between breakdowns have an exponential distribution. The repair times of the machines are independent and identically distributed random variables. If at any time a machine fails, it is sent to the repairman for repairing,...

متن کامل

Scaling Load-Store Queue

In order to tolerate long latency instructions, large load and store queue is necessary to bypass in flight information to dependent instruction; but as the latency goes up, the size of the load and store queue will increase as well, which will impact cycle time, area and power. Hierarchical designs in [2] and [10] was proposed to alleviate cycle time problem, but the CAM and search functions r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994